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Abstract 

DIF may be defined as an item that displays different statistical properties for different groups after 
matching the groups on an ability measure. For instance, with binary data DIF exists when there is a 
difference in the conditional probabilities of a correct response for two manifest groups. This papers 
argues that the occurrence of DIF can be explained by recognizing that the observed data do not reflect a 
homogeneous population of individuals but are a mixture of data from multiple latent populations or 
classes. This conceptualization of DIF hypothesizes that when one observes DIF using the current 
conceptualization of DIF it is only to the degree that the manifest groups are represented in the latent 
classes in different proportions. A Monte Carlo study was conducted to compare various approaches to 
detecting DIF under this formulation of DIF. Results showed that as the latent class proportions 
became more equal the DIF detection methods identification rates approached null condition levels. 
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Current approaches to item bias analysis involve the use of differential item functioning (DIF) 
methods to facilitate the identification of items that are measuring differently for manifest groups. An 
item identified as exhibiting DIF is subjected to a review by a panel of experts (a.k.a., "logical 
evidence of bias") to determine whether or not item is biased. It is the panel's conclusion that 
determines whether an item exhibiting DIF is also biased. 

DIF may be defined as an item that displays different statistical properties for different groups 
after matching the groups on an ability measure (Angoff, 1993). For instance, assuming dichotomous 
data and an item response theory (IRT) perspective, DIF is defined as a difference in the conditional 
probabilities of a correct response for two manifest groups (e.g., males and females). Graphically this 
may be represented as the difference between two item characteristic curves (ICCs) for the same item 
where one ICC is based on the item parameters estimated from the Reference (R) group response data 
and the other ICC is based on the parameter estimates based on the Focal (F) group response data after 
linking the two sets of item parameter estimates. 

There are two recognized forms of DIF. In the first case, the ICCs are parallel and this type of DIF 
is labeled as uniform DIF because one group is favored over the other group across the ability (0) scale. 
Nonuniform DIF reflects the fact that members of one manifest group perform better than members of 
the other manifest group for a part of the 0 scale, whereas this relationship is reversed for a different 
part of the ability scale. Graphically, an item that exhibits nonuniform DIF has ICC for one manifest 
v group crossing the ICC for the other manifest group. Moreover, this allows for the possibility that DIF 

for one group (positive DIF) may be wholly or partially compensated for by DIF against that group 
(negative DIF) at another point along the 0 continuum. 

One interpretation of an item having different ICCs across manifest groups, is that it may be 
reflective of an item that is measuring ability differently across groups. In effect, the different 
"abilities" are influencing the probability of a correct response to an item. As such, DIF may be 
conceptualized as a type of multidimensionality occurring when an item measures multiple dimensions 
and when the manifest groups differ in their relative locations to one another on the nonprimary 
ability(ies). If the two groups do not differ in their relative location on the nonprimary dimension(s), 
then neither group benefits from the nonprimary dimension and DIF does not occur even though the data 
are multidimensional (cf. Ackerman, 1992). 

Current IRT-based DIF analyses create, -in essence, two subsamples, albeit with known manifest 
characteristics (e.g., one subsample is female, the other is male). In this latter case, when the item 
parameter estimates for an item are not invariant across the manifest groups this difference is 
interpreted as evidence of DIF. However, in the context of IRT the absence of the invariance of item 
parameter estimates is evidence of model-data misfit. For instance, if this misfit had occurred across 
randomly created subsamples (i.e., a large calibration sample is randomly split into two subsamples 
each of which is separately calibrated), then this would be evidence that the IRT model may not be 
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appropriate for one or more items. Therefore, DIF is an example of model-data misfit and therefore, 
regardless of whether the examinee calibration group is randomly split into two sub samples or split 
into subsamples based on manifest characteristics (e.g., male/female), one may have indication that 
the model is incorrect for the data at hand. Moreover, items that have been identified as exhibiting 
DIF (i.e., misfitting items) may still be retained if logical evidence of bias is not forthcoming. 

Rather than interpret the difference between, for example, an item's difficulty estimate ( b ) for one 

A 

manifest group and the item’s b for the other manifest group (after linking) as DIF and trying to explain 
this DIF with an item bias review, the observed difference may be a reflection of different scales 
underlying the data. Specifically, the observed data do not reflect a homogeneous population of 
individuals but are a mixture of data from multiple latent populations or classes (LCs). Within each 
latent class one has quantitative individual differences (Rost, 1991), but the classes are qualitatively 
different. Therefore, within a class there is an ability scale and this scale is wholly or in part 
different than those in other classes. Clearly, there is a multidimensionality in this 
conceptualization, albeit conceptualized differently than in traditional multidimensional item 
response theory (e.g., see Ackerman, 1996; Camilli, 1992; Reckase, 1997). 

The above conceptualization of the observed data says, in short, that there are one or more latent 
classes and within each latent class there is an IRT model. In the simplest case there is only one latent 
class and the calibration sample contains only members from this class and one has model-data fit with 
a simple IRT model. However, when the observed data consists of members from different latent classes 
there is not an IRT model that accurately reflects the data for the entire calibration sample (i.e., there 
is model-data misfit). Rather, there are different item and ability parameters which are conditional 
on the different subpopulations or latent classes. Mixture distribution models such as those of Rost 
(1990) and Mislevy and Verhelst (1990) have addressed this general idea and their extensions of the 
Rasch model have been concerned with solution strategies that differ across subpopulations; also see 
Kelderman and Macready (1990) for a general framework. 

In certain situations examinee samples may actually consist of examinees from different latent 
classes or subpopulations. In the simplest multiclass situation the examinee sample consists of a 
mixture of two latent classes. If the latent classes are functionally equivalent to the manifest groups 
(i.e., 100% of the Reference group are masters, and 100% of the Focal group are nonmasters), then the 
current conceptualization of DIF will correctly identify DIF. However, it is unlikely that the latent 
classes will be equivalent to the manifest groups and the two manifest groups may in fact contain 
members from these two latent classes in different proportions. For example, 80% of the Reference 
manifest group may consist of masters, whereas 80% of the Focal manifest group may be nonmasters. 

Moreover, this conceptualization of DIF hypothesizes that when one observes DIF using the current 
conceptualization of DIF it is only to the degree that the manifest groups are represented in the latent 
classes in different proportions. If the manifest groups were equally represented in, say, the two latent 
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class situation, then it should not be possible to detect DIF using the current IRT-based strategy and 
these manifest groups. For instance, if the Focal group is defined as black students and 50% come from 
an impoverished environment (e.g., poor inner city schools) and the balance from a nonimpoverished 
environment (e.g., affluent suburban schools) the former environment may create a situation in which 
the students would not have the prerequisite (cognitive) skills to be classified as or belonging to, for 
example, a masters latent class, whereas the latter condition would lead to the development of the 
skills that would result in the students being classified in a masters latent class. Similarly, a 
Reference group consisting of white students could similarly be constructed with the impoverished 
environment being rural/ Appalachia (50%) and the nonimpoverished environment being the suburbs. A 
comparison of black/ white for DIF using standard methods would most likely lead to a false negative 
conclusion for DIF for each item. However, to the extent that environment affects the development of 
cognitive skills and thereby affects the classification in a master and nonmaster latent classes, then 
the items would be performing differentially across classes. That is, the items exhibit DIF, but not 
with respect to the black/ white manifest groups. If one used school/home environment as the 
characteristic for creating the manifest groups, then DIF would be made evident using standard IRT DIF 
analysis. Therefore, from this perspective DIF would potentially exist for an item anytime more than 
one latent class is required to obtain model-data fit for the item set. 

The various relationships between two latent classes and manifest groups discussed above are 
presented in Figure 1: 

Insert Figure 1 About Here 
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In summary, it is believed that the mechanism that gives rise to DIF is best modeled by the 
assumption of latent classes within which the items can be scaled. The examinee sample consists of a 
mixture of different latent classes within which one may obtain IRT model-data fit. Given this 
premise the three scenarios presented above were modeled: the LCs are functionally equivalent to the 
manifest groups (i.e., 100% of the Reference group belongs to one latent class and 100% of the Focal group 
belongs to the other latent class), the manifest groups consist of different proportions of the latent 
classes, and the manifest groups consist of equal proportions of the latent classes. The study consisted of 
two phases. The first phase compared six different methods for assessing DIF. These methods were the 
likelihood ratio (G 2 ) method of Thissen, Steinberg, and Wainer (1988, 1993), the logistic regression 
method (Swaminathan & Rogers, 1990), Lord’s Chi Square (Lord, 1980), Mantel-Haenszel Chi Square 
as presented by Holland and Thayer (1988), the Exact Signed Area and H Statistic approaches (Raju, 
1988, 1990). Thissen et al.'s G 2 approach (Gj), Lord's Chi Square (LCS), the Exact Signed Area (ESA) 

and H statistic are all IRT model-based methods. In the second phase of the study, the logistic 
regression (LR) and Mantel-Haenszel (MH) methods were used for DIF identification with latent class 
membership as the classificatory variable for the equal latent class proportions condition. 
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Methodology 

Fact o r s A fixed test length of 30 items was used and manifest group sizes were 500 simulees for the 
Focal group (Np = 500) and 2500 for Reference group (Nr = 2500) (cf. Camilli & Shepard, 1987; 
Donoghue, Holland, <Sc Thayer, 1993; Zwick, Donoghue, and Grima, 1993 ). The factors explicitly 
studied were the number of items exhibiting DIF (NIdif)/ the degree to which DIF was expressed by 
the DIF items (A b ), and the latent class proportions (7t v s). The NIdif factor consisted of three levels 
(0%, 10%, and 20% of the 30-item test length) and was included to examine the effect of different 
degrees of contamination on the conditioning variable on the various DIF approaches. For the DIF 
items there were two levels of the degree to which DIF was expressed by these items (A b = 0.3 and A b = 
1.0); these values come from Camilli and Shepard (1987). Two latent class were modeled and the 
latent class proportions (tc v s) were set at one of three levels: tii= 0.17/712= 0 . 83 ,tii= 0.30/712= 0.70, and 
7ii= 0.50/712= 0.50. These three factors resulted in 15 cells and phase l's design is showri in Figure 2. 

Insert Figure 2 About Here 

The manifest group sizes crossed by the latent class proportions would theoretically produce the 
crosstabulations shown in Figure 3. 

Insert Figure 3 About Here 



Data Generation: Because it was hypothesized that the mechanism underlying DIF can be modeled by 
an IRT model within latent classes, the data generation required two sets of item parameters, one for 
each latent class v. The IRT model was the two-parameter logistic (2PL) model with item 
discrimination fixed at 1.0 (the use a =1.0 has been used previous DIF work, e.g., Camilli & Shepard, 
1987). This may be represented as: 



p(uj-l I 0 V , flj v , b[ v ) = 



exp(fliy(9 v - b j v )) 

1 + exp(fliv(0 v - &iv)) 



( 1 ) 




where a[ v : item i discrimination for latent class v b[ v : item i difficulty for latent class v 

0y* the examinee's latent ability for latent class v uj = 1: correct response to item i 
The b[ v s were randomly generated from a N(0,1) distribution. For the DIF conditions the DIF items 
exhibited DIF as bn = b\2 + A b and for the nonDIF items b[\ = ^i2* Data were generated for each of 
the 30 items by first assigning simulees to a latent class. This involved summing the 7 ts across the latent 
classes and then comparing these successive sums to a number randomly generated from a U[0,1] 
distribution. The first sum that was greater than the random number indicated the simulee's class. 
Second, a bivariate normal N(0,1, p = 0.10) distribution was used to randomly generate a pair of unit 
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normal deviates to serve as the simulee's 0s. The simulees’ 0s on dimension 1 (v = 1) were rescaled to 
have a mean of -1.0 and a standard deviation of 1.0 (cf., Zwick, Donoghue, and Grima, 1993), whereas 

their dimension 2 0s were not rescaled (0 = 0, Oq =1.0). (Typically in DIF studies, the manifest group 
members come from populations with different means (e.g., Camilli & Shepard, 1987). However, given 
the study's premise it the average ability in the classes which needs to vary and manifest group 
ability means are determined by the mixture of latent class membership.) The simulee's probability of 
a correct response was calculated according to (1) using the appropriate parameters. If the probability 
was greater than a number randomly generated from a U[0,1] distribution, then the item response was 
coded 1 (correct), 0 otherwise. This process was repeated for all 30 items in a data set and for all 
simulees. Assignment of simulees to Focal and Reference groups was done to match the specifications 
provided in Figure 1 with the constraints that Np = 500 and Nr = 2500. Fifty data sets were generated 
in this way for each cell. 
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D I F Asse s sment : Four IRT-based approaches (G-p LCS, ESA and H Statistic measures) and two nortlRT 

2 

methods (LR, MH) were used. Gy compares a model which assumes a common ICC for both R and F 

groups with a second model that has all the parameters of the former model plus a group membership 
2 

parameter. Gj assesses whether the group parameter is necessary and one test is conducted for each 

2 

item in the item set. LR is similar to the Gy approach. For LR a set of nested models are created in 

which the criterion is the item response and the predictors are an ability measure (e.g., the number of 
correctly answered items, NC), group membership, and the interaction of these two predictors. The 
simplest of the three models contains only the ability measure, whereas the most complex contains all 
three predictors; the intermediate model does not contain the interaction term. The three-predictor 
model is compared to the two-predictor model to see if the interaction term is necessary. Contingent 
upon the outcome of this test, the two-predictor model is compared to the one-predictor model to 
determine the necessity of the membership predictor. LCS uses a X 2 to compare the item parameters 
estimates based on the Reference group data to that obtained using the Focal group data and takes into 
account the sample variance /covariance of an item's parameter estimates. MH also uses a X 2 test to 
detect DIF, however, it does not assume the validity of an IRT model. The MH X 2 test is used to 
determine if responses to an item are independent of group membership after conditioning on a matching 
variable, such as NC. ESA (signed area) and H (unsigned area) are based on the premise that if the 
area between the ICC based on the Reference group item parameter estimate(s) and the ICC based on 
the Focal group item parameter estimate(s) is zero, then the item is functioning "identically" across the 
two groups. A z-test* is used to determine whether any observed nonzero difference for an item is due to 
randomness or something systematic, such as, group membership. Like LCS, these area measures use the 
sample variance /covariance matrix of the item parameter estimates. 
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2 

For performing the Gj the procedure outlined by Thissen, Steinberg, and Wainer (1988, 1993) was 

followed and MULTILOG (Thissen, 1991) was used for calibration of each of the 50 data sets according 
to the 2PL model. In a similar fashion, the LR analysis was performed according to Swaminathan and 
Rogers (1990). A program was written to obtain the MH values for the items. For both MH and LR the 
number of items correctly answered was used as the ability measure. For LCS, ESA, and H each of the 
50 data sets was first decomposed into Reference and Focal groups. Then each group's response data 
were separately calibrated using the 2PL model; estimates were obtained via BILOG (Mislevy & Bock, 
1990). Stocking and Lord's test characteristic curve method (1983), as implemented in EQUATE (Baker, 
1993), was used to link the item parameter estimates from the Focal group to that of the Reference 
group. IRTDIF (Kim & Cohen, 1992) was used to obtain the LCS, ESA, and H values. 

Each of the six DIF methods was applied to all 30 items in each data set and the analyses was 
repeated for each cell in the design. The number of times an item was identified as exhibiting DIF was 
recorded. 

Results 
Phase 1: 

Table 1 shows the null condition. Across items, the error rates ranged from 3.7% to approximately 
8.4% for the various DIF methods. As would be expected, there does not appear to be any differences in 
the pattern of false positives across levels of 7t v s or across items. LR and MH showed the greatest 
agreement in their pattern of false positives across all ti v s conditions (tci = 0.17, K2 = 0.83: r = 0.846; = 

0.30, K2 = 0.70: r=0.942; = 0.50, K2 = 0.50: r=0.911). 

Insert Table 1 About Here 



Table 2 shows the results for the moderate' DIF condition (A b =0.3); the first three items are the 

DIF items. The 'tei = 0.17, K2 = 0.83' condition models the situation in which all manifest group 

members belong to only one latent class. The LR and MH approaches correctly identify the DIF items 

60.7% and 56% of the time, respectively, with false positive rates of approximately 5% or less. As the 

mixture of latent class membership within manifest group becomes progressively more equal, the LR 

and MH approaches correct identification rates decreased to null condition levels. LCS, H, ESA, and 
2 

Gp all had identification rates of 49% or less, although their false positive rates were similar to LR 
and MH. 

Insert Table 2 About Here 

The 'high' DIF condition for the three-item level (Table 3) showed high correct identification 
rates in the 'tii = 0.17 , 712 = 0.83' level for all methods. Gj, MH, and LR correctly identified the DIF 
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items in all 50 replications, although the false positive rates were larger than they were with the A b 
=0.3 level. Unlike the pattern in the moderate DIF (NIdif= 3) condition, the progression from ’7q = 
0.17, 712 = 0.83' to '7ii = 0.30, 712 = 0.70' had only a moderate effect on the correct identification rates of 
the methods, although both ESA and H were more affected than were LCS, Gy, MH, and LR. 

However, for the '711 = 0.50, 712 = 0.50' condition all of the methods had null condition identification 

rates. While all methods showed high intercorrelations (i.e., agreements in their patterns of correct 

identifications and false positives) in the '711 = 0.17, 712 = 0.83' condition, in the 'iq = 0.50, 712 = 0.50' 

only LR and MH were highly intercorrelated (r=0.911), with the next highest correlation (r=0.866) 

2 2 

existing between MH and G T ; LCS and had an r= 0.757, 

Insert Table 3 About Here 

The effect of increased contamination of the matching variable due to the number of DIF items is 
shown in Tables 4 and 5. While the overall pattern of decreasing correct identification rates as k\ 
approached 712 found in Table 2 exists in Table 4, a comparison of the corresponding correct 
identification rates shows that they are correspondingly less for the '7q = 0.17, 712 = 0.83' and 'n\ = 0.30, 
712 = 0.70' levels in the NIdif= 6 than in the NIoiF= 3 level. As was the case in Table 3, for the '711 = 
0.50, 712 = 0.50' condition only LR and MH were highly intercorrelated (r=0.899) and the next highest 
correlation existing between LCS and G j (r=0.878). 

Insert Table 4 About Here 

Similar to Table 3, Table 5 shows the same overall pattern for all levels of 7i v s, although the false 
positive rates are larger in the NIp)iF= 6 condition than in the NIp)IF = 3 condition. The correct 
identification rates of ESA and H appeared to be more adversely affected by the increase in the number 
of items exhibiting DIF than were the other measures. Moreover, for the 'n\ = 0.50, 712 = 0.50' condition 
correlation between LR and MH (r=0.921) was largest with the intercorrelation between LCS and G^ 
(r=0.746) next. 

Insert Table 5 About Here 



Phase 2: 

The above analyses used the manifest groups for identification of the DIF items. In those conditions 

where the manifest groups matched the latent class structure, the six methods were, in general, able to 

identify the DIF items with correct classification rates that were substantially higher than their false 

positive rates. Overall, MH and LR had consistently higher correct identification rates than did LCS, 

2 

ESA, and H; Gj had rates that were very close and in one instance, better than MH and LR. 

Phase 2 involved using the simulees latent class membership in lieu of manifest group membership 
with LR and MH. LR and MH were selected because they do not assume that an IRT model predicts the 
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data, and in general, they performed better than or similar to more computationally intensive methods 
such as Gy. Because the most problematic condition for all methods was 'n\ = 0.50, 712 = 0.50,' this was 

the condition used in phase 2. (It is assumed, given phase 1 results that the use of latent classes in lieu 
of manifest group membership for the 'ni = 0.17, 712 = 0.83' and ’7q = 0.30, 712 = 0.70' conditions will be no 
worse than for 'n\ = 0.50, 712 = 0.50'). Furthermore, given the predictable pattern observed in Tables 2-5 
only three of the possible 5 cells were analyzed: A b =0.0, A b =0.30/NIdif= 3/ and A b =1.0/NIdIF= 6. 
The results are presented in Table 6. Comparing the false positive rates for the A b =0.0 level with MH 
and LRs’ rates in Table 1 shows that the rates are higher in Table 6. In contrast, the false positive rates 
for A b = 0.3 were comparable to those in Table 2 for MH and LR, while the correct identification rates 
were substantially higher. Examination of the A b =1.0 level showed that MH and LR were able to 
correctly identify the DIF items 100% of the time, however, their false positive rates were 0.420 and 
0.480, respectively. These rates were substantially higher than the false positive rates for all DIF 
methods regardless of condition. Examination of the average MH D-DIF (i.e.,-2.35*log odds ratio (see 
Dorans & Holland, 1993, p. 41)) for items 7-30 (the false positives) in the '711 = 0.50, 712 = 0.50' 
condition showed that the degree of DIF would not necessarily be considered important according to the 
classification criteria used at ETS. 



Insert Table 6 About Here 
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Discussion 

Overall, the MH and LR methods had the highest correct identification rates of the six methods 
compared and were highly linearly related. The Gy method tended to do better than LCS, ESA, and H, 

particularly in the 'high' degree of DIF conditions. The MH and LR methods may be preferable to the 
other methods examined because they do not assume an IRT model. Moreover, the Gy method requires 

as many calibrations as the number of items plus one, whereas LCS, ESA, and H required linking the 
item parameter estimates for the two manifest groups. While MH examines items for uniform DIF, LR 
allows for an examination of both uniform and nonuniform DIF. 

The '711 = 0.17, 712 = 0.83' level models, according to the study's premise, the mechanism by which 
DIF exists and can be identified using manifest groups. As 7q approached 712 all methods based on 

manifest groups had greater difficulty in identifying the DIF items. The pattern of decreasing correct 
identification rates as k\ approached 712 found above was expected. 

If one conducts a DIF analysis of an item set with respect to gender or race as the manifest groups 
and determines that none of the items exhibit DIF, then this conclusion of no DIF is with respect only to 
the manifest groups of gender and race. Such an analysis does not allow one to conclude that there is no 
DIF in the item set. For instance, 2918 subjects with demographic infonnation were sampled from the 
data from the National Education Longitudinal Study of 1988 (NELS: 88) and used for a DIF analysis. 
The "test" consisted of twenty base year math items that spanned the difficulty ( b ) range in the math 
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item pool. A DIF analysis using both MH and LR was performed with these data. Using only 
examinees which responded that they were "white/ two different manifest variables were created. 
The first was a Residence variable (urban vs suburban) and the second was a School Type variable 
(public vs private; the latter category included both religious and nonreligious schools). Using 
Residence as the manifest grouping variable two items were identified as exhibiting DIF by MH and 
two items were identified by LR, but only one of which was identified by both LR and MH. According to 
Zieky (1993) all three items would be classified as type A and may still be used for test construction. 

The LR and MH analyses using School Type as the grouping variable identified the same three items as 
exhibiting DEF. Two of these would be classified as type A and one of which was type C (MH D 
DEF=2.37). In this latter case, the class C item would be considered problematic using ETS standards. 

In short, to the extent that an item set does not exhibit DIF with respect to race, gender, ethnicity 
does not mean that there is no DIF within the item set. Other socially identifiable groups may wish to 
be considered for inclusion as part of routine DIF analyses. In short, the selection of manifest grouping 
variables is based on political not psychometric considerations. 

The above NELS results should not be interpeted as indicating that the math component of NELS 
contains DIF items because these DIF analyses were performed on a subset (approximately 25%) of the 
total item pool as well as a subsample of examinees. Conceivably, the use of this smaller item set may 
affect the conditioning variable with certain DIF methods and therefore the DIF analysis. An item 
identified as exhibiting DIF with the 20 item set may not exhibit DIF when used with the entire item 

t 

pool or a different item set. (See Donoghue, Holland, & Thayer (1993) for the conditions under which 
MH D DIF is not influenced by the number of items in the conditioning variable score.) 

Ignoring the issue of self-identification of manifest group membership, a third consideration is the 
de facto (perhaps innocuous) assumption that individuals within a manifest group are more similar to 
one another than they are to members of the other manifest group. For instance, there may be a 
subgroup of the Focal group members that are disadvantaged by one or more items (i.e., in comparison to 
Reference group members), but the balance of the Focal group is not disadvantaged. In this case, the 
subgroup is not at all like the majority of the Focal group, however, the relative sizes of the Focal 
subgoup to the majority of the Focal group may result in the masking of DEF in these items. 

Finally, a fourth issue in the current approach to DIF analysis using manifest groups, is that 
deleting an item because of DIF may have unintended consequences for manifest groups that were not the 
focus of the analysis (Dorans & Holland, 1993). For example, removing an item that was flagged for 
positive gender DIF could lower females' scores and raise males’ scores while simultaneously increasing 
the scores of Hispanic and Asian-American females (Dorans & Holland, 1993). Therefore, 
reformulating DIF in terms of a mixed latent trait- class structure not only allows one to model the 
current approach DIF, it also minimizes or eliminates some of the above mentioned issues. 
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Table 1: Number of Times Each Item Was Identified as Exhibiting DIF (A b = 0: False Positives; N=50) 
and the intercorrelations of the DIF methods 
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